Feat/bias evaluator WinoBias (Gender)#83
Feat/bias evaluator WinoBias (Gender)#83chaitanyamedidar wants to merge 5 commits intoAOSSIE-Org:mainfrom
Conversation
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (2)
WalkthroughAdded an evaluation framework: a Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant Evaluator as PerplexityEvaluator
participant Dataset as HuggingFace Dataset
participant Tokenizer
participant Model
User->>Evaluator: evaluate(model, tokenizer)
Evaluator->>Dataset: load benchmark "test" split (stream)
Dataset-->>Evaluator: rows
loop per sample (up to n_samples)
Evaluator->>Tokenizer: encode(text)
Tokenizer-->>Evaluator: token_ids
Evaluator->>Model: request logits (teacher-forced steps)
Model-->>Evaluator: logits
Evaluator->>Evaluator: compute log-softmax, NLL → sentence_ppl
end
Evaluator->>Evaluator: aggregate mean perplexity
Evaluator-->>User: {perplexity: float}
sequenceDiagram
actor User
participant Evaluator as WinoBiasEvaluator
participant Dataset as HuggingFace Dataset
participant Tokenizer
participant Perplexity as PerplexityEvaluator
participant Model
User->>Evaluator: evaluate(model, tokenizer)
Evaluator->>Dataset: load_dataset("wino_bias", "type1_pro")
Dataset-->>Evaluator: pro rows
loop pro samples (up to n_samples)
Evaluator->>Tokenizer: encode(text)
Tokenizer-->>Evaluator: token_ids
Evaluator->>Perplexity: compute_sentence_perplexity(model, token_ids)
Perplexity->>Model: request logits
Model-->>Perplexity: logits
Perplexity-->>Evaluator: perplexity score
end
Evaluator->>Evaluator: compute stereotype_score (mean)
Evaluator->>Dataset: load_dataset("wino_bias", "type1_anti")
Dataset-->>Evaluator: anti rows
loop anti samples (up to n_samples)
Evaluator->>Tokenizer: encode(text)
Tokenizer-->>Evaluator: token_ids
Evaluator->>Perplexity: compute_sentence_perplexity(model, token_ids)
Perplexity->>Model: request logits
Model-->>Perplexity: logits
Perplexity-->>Evaluator: perplexity score
end
Evaluator->>Evaluator: compute anti_stereotype_score (mean)
Evaluator->>Evaluator: bias_score = abs(stereotype - anti_stereotype)
Evaluator-->>User: {stereotype_score, anti_stereotype_score, bias_score}
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested labels
Poem
🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/eval/bias/wino_bias.py`:
- Around line 83-86: The computation of bias_score uses stereotype_score and
anti_stereotype_score from _score_split, and when both return inf the
subtraction yields NaN; update the logic after computing stereotype_score and
anti_stereotype_score to detect when both are infinite (e.g., via math.isfinite
or math.isinf) and in that case set bias_score to a stable sentinel like
float("inf") instead of performing the subtraction; otherwise keep the existing
abs(stereotype_score - anti_stereotype_score) behavior so normal finite values
are unchanged.
In `@openverifiablellm/eval/perplexity.py`:
- Around line 29-43: The stride parameter is stored in __init__ but never used;
update the evaluator's sequence-processing logic (e.g., the method that prepares
token windows or computes perplexity—look for methods like evaluate, score, or
compute_perplexity) to respect self.stride when a text exceeds the model context
length by implementing sliding windows with overlap equal to self.stride (or
using stride to advance the window), ensure tokenization/truncation uses these
windows rather than a single chop, and keep the docstring and self.stride
assignment in sync; also add tests or a small example in the evaluate path to
verify long sequences are segmented with the configured stride.
- Around line 141-149: The loop in perplexity evaluation (in
openverifiablellm.eval.perplexity) checks self.n_samples before filtering empty
texts so blank rows consume the quota; change the logic to count only actual
evaluated samples by moving the n_samples check after the empty-text check (or
maintain an explicit evaluated counter that increments only when you call
compute_sentence_perplexity and append to scores). In practice, inside the
for-loop (where tokenizer.encode, compute_sentence_perplexity, and scores.append
are used), skip empty texts first, then check if the evaluated sample count (use
len(scores) or a separate counter) has reached self.n_samples and break,
ensuring compute_sentence_perplexity is only called for non-empty rows and the
n_samples limit counts evaluated samples.
- Around line 103-114: The code currently zips logits_batch and targets which
silently truncates when model output length is wrong; update the scoring logic
in perplexity.py (around the use of logits_batch, model, targets and the loop
computing nll_sum) to first validate that len(logits_batch) == len(targets) and
raise a clear exception (e.g., ValueError) showing both lengths if they differ,
and optionally also validate that each logits has length > max(target) or
matches expected vocab_size before computing log-probs; only proceed to the
existing loop and return math.exp(nll_sum / len(targets)) after these checks.
In `@pyproject.toml`:
- Around line 14-19: The dependencies list in pyproject.toml is missing core LLM
packages; add "numpy" and "torch" to the existing dependencies array (alongside
"datasets", "defusedxml", "sentencepiece", "tokenizers==0.15.2") so they are
installed as required dependencies; update the dependencies section to include
"numpy" and "torch" entries and ensure versions are specified if necessary for
compatibility.
In `@tests/test_eval.py`:
- Around line 48-52: The test helper _load currently returns anti_data for any
unexpected inputs which can mask integration bugs; update the mock _load
function to strictly validate its inputs (check the incoming name, config, and
split against the expected values) and raise an explicit exception (e.g.,
ValueError or AssertionError) when an unexpected name/config/split is passed
instead of defaulting to anti_data; adjust references to pro_data/anti_data
accordingly and ensure tests that expect fallback behavior are updated,
mentioning the mock function name _load (and any callers like evaluate) so
reviewers can locate the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 9bab766c-34c7-479f-b7cb-b51e0c46b3bd
📒 Files selected for processing (7)
openverifiablellm/eval/__init__.pyopenverifiablellm/eval/base.pyopenverifiablellm/eval/bias/__init__.pyopenverifiablellm/eval/bias/wino_bias.pyopenverifiablellm/eval/perplexity.pypyproject.tomltests/test_eval.py
| dependencies = [ | ||
| "datasets", | ||
| "defusedxml", | ||
| "sentencepiece", | ||
| "tokenizers==0.15.2" | ||
| ] |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name "pyproject.toml" -type fRepository: AOSSIE-Org/OpenVerifiableLLM
Length of output: 88
🏁 Script executed:
cat -n ./pyproject.tomlRepository: AOSSIE-Org/OpenVerifiableLLM
Length of output: 1676
Add numpy and torch to required dependencies.
These are core LLM dependencies for this project and should be in the required dependency list, not optional.
Suggested patch
dependencies = [
"datasets",
+ "numpy",
+ "torch",
"defusedxml",
"sentencepiece",
"tokenizers==0.15.2"
]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| dependencies = [ | |
| "datasets", | |
| "defusedxml", | |
| "sentencepiece", | |
| "tokenizers==0.15.2" | |
| ] | |
| dependencies = [ | |
| "datasets", | |
| "numpy", | |
| "torch", | |
| "defusedxml", | |
| "sentencepiece", | |
| "tokenizers==0.15.2" | |
| ] |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pyproject.toml` around lines 14 - 19, The dependencies list in pyproject.toml
is missing core LLM packages; add "numpy" and "torch" to the existing
dependencies array (alongside "datasets", "defusedxml", "sentencepiece",
"tokenizers==0.15.2") so they are installed as required dependencies; update the
dependencies section to include "numpy" and "torch" entries and ensure versions
are specified if necessary for compatibility.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
openverifiablellm/eval/bias/wino_bias.py (1)
83-83: 🧹 Nitpick | 🔵 TrivialMove
import mathto module level.The
mathimport is placed mid-function after the nested_score_splitdefinition. Standard practice is to place imports at module level for clarity and to avoid repeated import overhead ifevaluate()is called multiple times (though Python caches imports, the style is unconventional).♻️ Suggested fix
Move the import to the top of the file:
from typing import Optional +import math from ..base import BaseEvaluatorThen remove line 83:
return float(sum(scores) / len(scores)) if scores else float("inf") - import math - stereotype_score = _score_split(pro_ds)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/eval/bias/wino_bias.py` at line 83, The import math is inside the evaluate() function after the nested _score_split definition; move the import to the module level (with the other imports at the top of the file) and remove the in-function "import math" statement so evaluate() and _score_split use the top-level import instead.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/eval/bias/wino_bias.py`:
- Around line 78-81: The current averaging in the block that calls
PerplexityEvaluator.compute_sentence_perplexity can produce float("inf") if any
sentence returns infinity; update the logic in the loop/return that collects
scores (where scores is appended using
PerplexityEvaluator.compute_sentence_perplexity(model, token_ids)) to filter out
math.isfinite or not-infinite values before computing the mean, and if no finite
scores remain return float("inf") (or keep the original fallback) so a single
infinite sentence doesn't make the whole split score infinite.
---
Duplicate comments:
In `@openverifiablellm/eval/bias/wino_bias.py`:
- Line 83: The import math is inside the evaluate() function after the nested
_score_split definition; move the import to the module level (with the other
imports at the top of the file) and remove the in-function "import math"
statement so evaluate() and _score_split use the top-level import instead.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 97aac695-5da9-424b-8626-283d689db63b
📒 Files selected for processing (1)
openverifiablellm/eval/bias/wino_bias.py
| scores.append( | ||
| PerplexityEvaluator.compute_sentence_perplexity(model, token_ids) | ||
| ) | ||
| return float(sum(scores) / len(scores)) if scores else float("inf") |
There was a problem hiding this comment.
Consider filtering inf values before computing the mean.
If compute_sentence_perplexity returns float("inf") for any sentence (e.g., sequences with < 2 tokens), the entire split score becomes inf since sum([..., inf, ...]) is inf. While WinoBias sentences are typically well-formed, malformed or edge-case entries could skew the entire evaluation.
🛡️ Suggested defensive approach
- return float(sum(scores) / len(scores)) if scores else float("inf")
+ finite_scores = [s for s in scores if math.isfinite(s)]
+ return float(sum(finite_scores) / len(finite_scores)) if finite_scores else float("inf")This filters out infinite values, computing the mean only over valid perplexity scores.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@openverifiablellm/eval/bias/wino_bias.py` around lines 78 - 81, The current
averaging in the block that calls
PerplexityEvaluator.compute_sentence_perplexity can produce float("inf") if any
sentence returns infinity; update the logic in the loop/return that collects
scores (where scores is appended using
PerplexityEvaluator.compute_sentence_perplexity(model, token_ids)) to filter out
math.isfinite or not-infinite values before computing the mean, and if no finite
scores remain return float("inf") (or keep the original fallback) so a single
infinite sentence doesn't make the whole split score infinite.
Addressed Issues:
Implement WinoBias Gender Bias Evaluator
directly implements the bias testing evaluation metric
listed as a success criterion in the project specification
Screenshots/Recordings:
Additional Notes:
The project motivation explicitly states that LLM providers have growing incentive to bias models in favour of their sponsors and advertisers.
Without a concrete measurement tool, the claim of an impartial, unbiased model cannot be verified. This PR implements the first piece of that measurement suite.
Structure:
Why WinoBias:
Gender bias is one of the most well-documented forms of systematic skew that emerges from biased training data, the exact problem this project addresses. WinoBias is publicly available on HuggingFace, consistent with the project's open data philosophy, and provides clean paired sentence comparisons making bias measurement interpretable.
Why this structure (eval/bias/ subpackage):
Each bias benchmark is its own independent class in its own file. This means future benchmarks like TruthfulQA for factual bias, PoliEval for political bias can be added and reviewed as separate PRs without any merge conflicts.
We can easily merge any benchmark independently.
Checklist
We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.
Summary by CodeRabbit
New Features
Dependencies
Tests